Utility of pre-treatment FDG PET/CT–derived machine learning models for outcome prediction in classical Hodgkin lymphoma

Objectives Relapse occurs in ~20% of patients with classical Hodgkin lymphoma (cHL) despite treatment adaption based on 2-deoxy-2-[18F]fluoro-d-glucose positron emission tomography/computed tomography response. The objective was to evaluate pre-treatment FDG PET/CT–derived machine learning (ML) models for predicting outcome in patients with cHL. Methods All cHL patients undergoing pre-treatment PET/CT at our institution between 2008 and 2018 were retrospectively identified. A 1.5 × mean liver standardised uptake value (SUV) and a fixed 4.0 SUV threshold were used to segment PET/CT data. Feature extraction was performed using PyRadiomics with ComBat harmonisation. Training (80%) and test (20%) cohorts stratified around 2-year event-free survival (EFS), age, sex, ethnicity and disease stage were defined. Seven ML models were trained and hyperparameters tuned using stratified 5-fold cross-validation. Area under the curve (AUC) from receiver operator characteristic analysis was used to assess performance. Results A total of 289 patients (153 males), median age 36 (range 16–88 years), were included. There was no significant difference between training (n = 231) and test cohorts (n = 58) (p value > 0.05). A ridge regression model using a 1.5 × mean liver SUV segmentation had the highest performance, with mean training, validation and test AUCs of 0.82 ± 0.002, 0.79 ± 0.01 and 0.81 ± 0.12. However, there was no significant difference between a logistic model derived from metabolic tumour volume and clinical features or the highest performing radiomic model. Conclusions Outcome prediction using pre-treatment FDG PET/CT–derived ML models is feasible in cHL patients. Further work is needed to determine optimum predictive thresholds for clinical use. Key points • A fixed threshold segmentation method led to more robust radiomic features. • A radiomic-based model for predicting 2-year event-free survival in classical Hodgkin lymphoma patients is feasible. • A predictive model based on ridge regression was the best performing model on our dataset. Supplementary Information The online version contains supplementary material available at 10.1007/s00330-022-09039-0.


Background and objectives 3a
Explain the medical context (including whether diagnostic or prognostic) and rationale for developing or validating the multivariable prediction model, including references to existing models.

* 3b
Specify the objectives, including whether the study describes the development or validation of the model or both. *

Source of data 4a
Describe the study design or source of data (e.g., randomized trial, cohort, or registry data), separately for the development and validation data sets, if applicable. * 4b Specify the key study dates, including start of accrual; end of accrual; and, if applicable, end of follow-up. *

Participants 5a
Specify key elements of the study setting (e.g., primary care, secondary care, general population) including number and location of centres. * 5b Describe eligibility criteria for participants. * 5c Give details of treatments received, if relevant. *

Outcome 6a
Clearly define the outcome that is predicted by the prediction model, including how and when assessed. * 6b Report any actions to blind assessment of the outcome to be predicted. *

Predictors 7a
Clearly define all predictors used in developing or validating the multivariable prediction model, including how and when they were measured. * 7b Report any actions to blind assessment of predictors for the outcome and other predictors. * Sample size 8 Explain how the study size was arrived at. *

Missing data 9
Describe how missing data were handled (e.g., complete-case analysis, single imputation, multiple imputation) with details of any imputation method.

Participants 13a
Describe the flow of participants through the study, including the number of participants with and without the outcome and, if applicable, a summary of the follow-up time. A diagram may be helpful.

* 13b
Describe the characteristics of the participants (basic demographics, clinical features, available predictors), including the number of participants with missing data for predictors and outcome.

Model development 14a
Specify the number of participants and outcome events in each analysis. * This study looks at the utility of pre-treatment FDG PET/CT derived machine learning models for outcome prediction in classical Hodgkin lymphoma. (Title) The abstract covers a summary of all the requested information. (Abstract) a) The introduction presents the background of HL sets out the aim of creating a predictive model using machine learning techniques using radiomic features derived from the baseline PET/CT. Two previous papers are discussed which aim to create a similar model. (Introduction) b) The aim of this study was to create a predictive model using radiomic features derived from pre-treatment FDG PET/CT to predict 2-year EFS in HL patients using a larger tertiary centre cohort of patients (Introduction) a) This is a retrospective single centre cohort study. The study cohort was randomised on a ratio of 4:1 into training and testing cohorts stratified around 2-EFS, age, gender, ethnicity and disease stage. (Patient selection) b) Consecutive patients with histologically proven cHL who underwent baseline FDG-PET/CT at a single large tertiary referral centre between June 2008 and January 2018 were included. The follow up information recorded is set out in the patient selection section. (Patient selection) a) This is a single tertiary centre study. (Patient selection) b) Patients were excluded if they were under 16 years of age, did not have cHL, had treatment prior to their staging PET/CT study, did not have measurable disease on PET/CT, had a concurrent malignancy, they did not have disease over 4.0SUV, had hepatic involvement or if the images were degraded or incomplete. The follow up information recorded is set out (Patient selection) c) The treatment regimen for the cohort is set out in Table 2. No change to departmental standard treatment was performed. (Table 2) a) An event was defined as relapse, recurrence or death within the 2 year follow up period. (Patient selection) b) As this was a retrospective study the primary outcomes were defined from clinical records. The investigator reviewing the records was blinded to the imaging parameters.
a) The description of the contouring method, resampling, harmonisation, radiomic feature extraction and the methods used for feature selection are documented within the method section and Supplementary Material 2. The features selected as part of the models are described in Table 3. (Materials and methods, Supplementary Material 2, Results) b) The images were contoured and analysed without reference to the outcome data.
All patients which met the inclusion criteria were included. The cut-off of January 2018 was chosen to allow for 2 year follow up without confounding factors introduced due to the covid-19 pandemic. For feature selection 5 features were chosen as the maximum number of features to be include in each model. This was derived from 10 events per parameter, with 54 events within the training cohort. (Materials and methods, Results) Only complete data sets were used in the analysis. (Results) a) Clinical factors were included in the variable selection process alongside radiomic features. The categorical data was dummy encoded. Continuous features were normalised using a standard scaler. (Machine learning analysis, Supplementary Material 2) b) Random forest, support vector machine, logistic regression, k-nearest neighbour, single layer perceptron, multilayer perceptron and Gaussian process classifier models were trained and tuned on the training cohort using cross validation. The models were created using different feature selection methods, the bin width or bin number was selected based on the method which had the greatest robust features (intraclass correlation coefficient >0.8) following regimentation. A model was generated using radiomic features from a fixed 4.0 SUV threshold segmentation technique and a 1.5 x mean liver SUV threshold segmentation technique. A model was also created using metabolic tumour volume. The models with the highest mean receiver operating characteristic (ROC) area under the curve (AUC) were tested and compared on the unseen test cohort. (Machine learning analysis, Supplementary Material 2,) d) When comparing models, the mean validation AUC was used to determine the best performing model. A Delong test was used to compare the AUCs of the test set. (Machine learning analysis, Supplementary Material 2,) Risk groups were not created within the model. a) 289 patients were included, with demographics detailed in Table 2. (Results) b) The characteristics of the participants are presented in Table 2.
a) The number of events per cohort are presented in Table 2. b) This has not been performed. The training and testing cohorts were stratified around key clinical features, but the results are not adjusted for these. Further analysis was performed looking at how the model performed on patients treated as having advanced disease. a/b) The features and hyperparameters used to create the model are presented in the Clinical and radiomic model for the prediction of 2-EFS section.
The mean validation and test ROC curves are presented. The 95% confidence intervals are presented.
The limitations of the study are presented. These include the retrospective nature of the study, the relative low number of events, reliance on clinical records, the exclusion of patients with hepatic disease or disease not meeting the 4.0 SUV threshold, variation in patient treatment and that there was no external validation. (Discussion) b)/20. The discussion section gives an overall interpretation of the results and highlights the potential use of a pretreatment model to aid in early personalised treatment for patients. (Discussion) The python libraries used are references within the text. The radiomic features extracted using PyRadiomics are detailed in Supplementary Material Table 2.
The study was not externally funded. Individual author's funding is declared within the Declaration.

Image segmentation
Image data were viewed and contoured using specialised multimodality imaging software (RTx v1.8.2, Mirada Medical). Lymphomatous disease segmentation was performed by a clinical radiologist with six years' experience and a research radiographer with 2 years' experience of segmenting crosssectional imaging and reviewed by two dual-certified Radiology and Nuclear Medicine Physicians with >15 years' experience of oncological PET/CT interpretation. Any discrepancies were agreed in consensus. Two segmentation techniques were utilised, the first using a fixed threshold of 4.0 SUV and the second using a threshold of 1.5 x liver SUVmean was used to contour disease sites on PET, this method has been used in different cancer types [16,17]. The mean liver SUV was determined by placing a 110 cm 3 region of interest in the right lobe of the liver. The contour from the PET was translated to the co-registered unenhanced low-dose CT component of the study with the contours matched to soft tissue with a value of -10 to 100 Hounsfield units (HU). Contours were exported as digital imaging and communications in medicine (DICOM) radiotherapy (RT) structures. Ten percent of the cases were re-segmented using the same methodology described by the radiologist who performed the initial segmentation after a 3-month washout period using Slicer (v4.11). These segmentations were used to test the repeatability of the segmentation techniques and to test the robustness of the extracted features. .

Machine learning analysis
The study cohort was split into training and test cohorts stratified around 2-year EFS (2-EFS), age, sex, ethnicity, stage of disease, having radiotherapy, having ABVD-based chemotherapy and being treated as advanced disease using scikit-learn (v0.24.2). Ethnicity was defined by the volunteered information from patients. Given the low numbers of some ethnic groups, it was not possible to stratify the training and tests around ethnicity without splitting the data into Caucasian and non-Caucasian ethnic groups.
The cohorts were split using an 80:20 ratio. Mann-Whitney U and χ 2 tests (SciPy v1.6.3) were used to assess for significance in continuous and categorical clinical characteristics between the training and test cohorts respectively. A p-value less than 0.05 was regarded as significant. Categorical data was dummy encoded (Pandas v1.2.4), and continuous data was normalised using a standard scaler (scikit- Missing clinical data meant that a comparison with commonly utilised clinical scoring methods was not possible and the treatment regime used was used a surrogate indicator of whether the patient was deemed to have early or advanced disease. Eur Radiol (2022) 1: detailing the radiomic features extracted for both the PET and CT components. The equations for the features can be found at https://pyradiomics.readthedocs.io/en/latest/features.html. GLCM = grey level co-occurrence matrix, GLDM = grey level dependence matrix, GLRLM = grey level run length matrix, GLSZM = grey level size zone matrix, NGTDM = neighbouring grey tone difference matrix, Id = inverse difference, Idn = inverse difference normalised, Imc = informational measure of correlation, Idm = inverse difference moment, Idmn = inverse difference moment normalised, MCC = Matthews correlation coefficient. Each of the first and second order features were extracted from the original imaging and then from the images following filters applied. The filters used were: wavelet (LLL, LLH, LHL, LHH, HHH, HLH, HHL, HLL); log-signa (1.0, 2.0, 3.0, 4.0); square; square root; logarithm; exponential; gradient; lbp-3D (m1, m2, k).